;;; -*- Mode: Lisp; Package: CCL; Coding: utf-8; -*-

(chapter "Implementation Details of {CCL}"
  (para #:|This chapter describes many aspects of OpenMCL's
    implementation as of (roughly) version 1.1. Details vary a bit
    between the three architectures (PPC32, PPC64, and x86-64)
    currently supported and those details change over time, so the
    definitive reference is the source code (especially some files in
    the ccl/compiler/ directory whose names contain the string "arch"
    and some files in the ccl/lisp-kernel/ directory whose names
    contain the string "constants".) Hopefully, this chapter will make
    it easier for someone who's interested to read and understand the
    contents of those files.|)
  (defsection "Threads and exceptions"
    #:|{CCL}'s threads are "native" (meaning that they're
        scheduled and controlled by the operating system.)  Most of the
        implications of this are discussed elsewhere; this section tries
        to describe how threads look from the lisp kernel's perspective
        (and especially from the GC's point of view.)

        {CCL}'s runtime system tries to use machine-level
        exception mechanisms (conditional traps when available,
        illegal instructions, memory access protection in some cases)
        to detect and handle exceptional situations.  These situations
        include some TYPE-ERRORs and PROGRAM-ERRORS (notably
        wrong-number-of-args errors), and also include cases like "not
        being able to allocate memory without GCing or obtaining more
        memory from the OS."  The general idea is that it's usually
        faster to pay (very occasional) exception-processing overhead
        and figure out what's going on in an exception handler than it
        is to maintain enough state and context to handle an
        exceptional case via a lighter-weight mechanism when that
        exceptional case (by definition) rarely occurs.

        Some emulated execution environments (the Rosetta PPC
        emulator on x86 versions of Mac OS X) don't provide accurate
        exception information to exception handling functions. {CCL}
        can't run in such environments.|
    (defsection "The Thread Context Record"
      #:|When a lisp thread is first created (or when a thread
          created by foreign code first calls back to lisp), a data
          structure called a Thread Context Record (or TCR) is
          allocated and initialized.  On modern versions of Linux and
          FreeBSD, the allocation actually happens via a set of
          thread-local-storage ABI extensions, so a thread's TCR is
          created when the thread is created and dies when the thread
          dies.  (The World's Most Advanced Operating System-as
          Apple's marketing literature refers to Darwin-is not
          very advanced in this regard, and I know of no reason to
          assume that advances will be made in this area anytime
          soon.)

          A TCR contains a few dozen fields (and is therefore a
          few hundred bytes in size.)  The fields are mostly
          thread-specific information about the thread's stacks'
          locations and sizes, information about the underlying (POSIX)
          thread, and information about the thread's dynamic binding
          history and pending CATCH/UNWIND-PROTECTs.  Some of this
          information could be kept in individual machine registers
          while the thread is running (and the PPC - which has more
          registers available - keeps a few things in registers that the
          X86-64 has to access via the TCR), but it's important to
          remember that the information is thread-specific and can't
          (for instance) be kept in a fixed global memory
          location.

          When lisp code is running, the current thread's TCR is
          kept in a register.  On PPC platforms, a general purpose
          register is used; on x86-64, an (otherwise nearly useless)
          segment register works well (prevents the expenditure of a
          more generally useful general- purpose register for this
          purpose.)

          The address of a TCR is aligned in memory in such a way
          that a FIXNUM can be used to represent it.  The lisp function
          CCL::%CURRENT-TCR returns the calling thread's TCR as a
          fixnum; actual value of the TCR's address is 4 or 8 times the
          value of this fixnum.

          When the lisp kernel initializes a new TCR, it's added
          to a global list maintained by the kernel; when a thread
          exits, its TCR is removed from this list.

          When a thread calls foreign code, lisp stack pointers
          are saved in its TCR, lisp registers (at least those whose
          value should be preserved across the call) are saved on the
          thread's value stack, and (on x86-64) RSP is switched to the
          control stack.  A field in the TCR (tcr.valence) is then set
          to indicate that the thread is running foreign code, foreign
          argument registers are loaded from a frame on the foreign
          stack, and the foreign function is called. (That's a little
          oversimplified and possibly inaccurate, but the important
          things to note are that the thread "stops following lisp
          stack and register usage conventions" and that it advertises
          the fact that it's done so.  Similar transitions in a
          thread's state ("valence") occur when it enters or exits an
          exception handler (which is sort of an OS/hardware-mandated
          foreign function call where the OS thoughtfully saves the
          thread's register state for it beforehand.)|)
    (defsection "Exception contexts, and exception-handling in general"
      #:|Unix-like OSes tend to refer to exceptions as "signals";
          the same general mechanism ("signal handling") is used to
          process both asynchronous OS-level events (such as the result
          of the keyboard driver noticing that ^C or ^Z has been
          pressed) and synchronous hardware-level events (like trying to
          execute an illegal instruction or access protected memory.)
          It makes some sense to defer ("block") handling of
          asynchronous signals so that some critical code sequences
          complete without interruption; since it's generally not
          possible for a thread to proceed after a synchronous exception
          unless and until its state is modified by an exception
          handler, it makes no sense to talk about blocking synchronous
          signals (though some OSes will let you do so and doing so can
          have mysterious effects.)

          On OSX/Darwin, the POSIX signal handling facilities
          coexist with lower-level Mach-based exception handling
          facilities.  Unfortunately, the way that this is implemented
          interacts poorly with debugging tools: GDB will generally stop
          whenever the target program encounters a Mach-level exception
          and offers no way to proceed from that point (and let the
          program's POSIX signal handler try to handle the exception);
          Apple's CrashReporter program has had a similar issue and,
          depending on how it's configured, may bombard the user with
          alert dialogs which falsely claim that an application has
          crashed (when in fact the application in question has
          routinely handled a routine exception.)  On Darwin/OSX,
          {CCL} uses Mach thread-level exception handling facilities
          which run before GDB or CrashReporter get a chance to confuse
          themselves; {CCL}'s Mach exception handling tries to force
          the thread which received a synchronous exception to invoke a
          signal handling function ("as if" signal handling worked more
          usefully under Darwin.)  Mach exception handlers run in a
          dedicated thread (which basically does nothing but wait for
          exception messages from the lisp kernel, obtain and modify
          information about the state of threads in which exceptions
          have occurred, and reply to the exception messages with an
          indication that the exception has been handled.  The reply
          from a thread-level exception handler keeps the exception from
          being reported to GDB or CrashReporter and avoids the problems
          related to those programs.  Since {CCL}'s Mach exception
          handler doesn't claim to handle debugging-related exceptions
          (from breakpoints or single-step operations), it's possible to
          use GDB to debug {CCL}.

          On platforms where signal handling and debugging don't
          get in each other's way, a signal handler is entered with
          all signals blocked.  (This behavior is specified in the
          call to the sigaction() function which established the
          signal handler.)  The signal handler receives three
          arguments from the OS kernel; the first is an integer that
          identifies the signal, the second is a pointer to an object
          of type "siginfo_t", which may or may not contain a few
          fields that would help to identify the cause of the
          exception, and the third argument is a pointer to a data
          structure (called a "ucontext" or something similar), which
          contains machine-dependent information about the state of
          the thread at the time that the exception/signal occurred.
          While asynchronous signals are blocked, the signal handler
          stores the pointer to its third argument (the "signal
          context") in a field in the current thread's TCR, sets some
          bits in another TCR field to indicate that the thread is now
          waiting to handle an exception, unblocks asynchronous
          signals, and waits for a global exception lock that
          serializes exception processing.

          On Darwin, the Mach exception thread creates a signal
          context (and maybe a siginfo_t structure), stores the signal
          context in the thread's TCR, sets the TCR field which describes
          the thread's state, and arranges that the thread resume
          execution at its signal handling function (with a signal
          handler, possibly NULL siginfo_t, and signal context as
          arguments.  When the thread resumes, it waits for the global
          exception lock.

          On x86-64 platforms where signal handing can be used to
          handle synchronous exceptions, there's an additional
          complication: the OS kernel ordinarily allocates the signal
          context and siginfo structures on the stack of the thread
          that received the signal; in practice, that means "wherever
          RSP is pointing."  {CCL}'s
          {section Register and stack usage conventions}
          require that the thread's value stack-where RSP is
          usually pointing while lisp code is running-contain
          only "nodes" (properly tagged lisp objects), and scribbling
          a signal context all over the value stack would violate this
          requirement.  To maintain consistency, the sigaltstack()
          mechanism is used to cause the signal to be delivered on
          (and the signal context and siginfo to be allocated on) a
          special stack area (the last few pages of the thread's
          control stack, in practice).  When the signal handler runs,
          it (carefully) copies the signal context and siginfo to the
          thread's control stack and makes RSP point into that stack
          before invoking the "real" signal handler. The effect of
          this hack is that the "real" signal handler always runs on
          the thread's control stack.

          Once the exception handler has obtained the global
          exception lock, it uses the values of the signal number,
          siginfo_t, and signal context arguments to determine the
          (logical) cause of the exception.  Some exceptions may be
          caused by factors that should generate lisp errors or other
          serious conditions (stack overflow); if this is the case, the
          kernel code may release the global exception lock and call out
          to lisp code.  (The lisp code in question may need to repeat
          some of the exception decoding process; in particular, it
          needs to be able to interpret register values in the signal
          context that it receives as an argument.)

          In some cases, the lisp kernel exception handler may not
          be able to recover from the exception (this is currently true
          of some types of memory-access fault and is also true of traps
          or illegal instructions that occur during foreign code
          execution.  In such cases, the kernel exception handler
          reports the exception as "unhandled", and the kernel debugger
          is invoked.

          If the kernel exception handler identifies the
          exception's cause as being a transient out-of-memory condition
          (indicating that the current thread needs more memory to cons
          in), it tries to make that memory available.  In some cases,
          doing so involves invoking the GC.|)
    (defsection "Threads, exceptions, and the GC"
      #:|{CCL}'s GC is not concurrent: when the GC is invoked in
          response to an exception in a particular thread, all other
          lisp threads must stop until the GC's work is done.  The
          thread that triggered the GC iterates over the global TCR
          list, sending each other thread a distinguished "suspend"
          signal, then iterates over the list again, waiting for a
          per-thread semaphore that indicates that the thread has
          received the "suspend" signal and responded appropriately.
          Once all other threads have acknowledged the request to
          suspend themselves, the GC thread can run the GC proper (after
          doing any necessary {section PC-lusering}.)  Once the
          GC's completed its work, the thread that invoked the GC
          iterates over the global TCR list, raising a per-thread
          "resume" semaphore for each other thread.

          The signal handler for the asynchronous "suspend" signal
          is entered with all asynchronous signals blocked.  It saves
          its signal-context argument in a TCR slot, raises the tcr's
          "suspend" semaphore, then waits on the TCR's "resume"
          semaphore.

          The GC thread has access to the signal contexts of all
          TCRs (including its own) at the time when the thread received
          an exception or acknowledged a request to suspend itself.
          This information (and information about stack areas in the TCR
          itself) allows the GC to identify the "stack locations and
          register contents" that are elements of the GC's root
          set.|)
    (defsection "PC-lusering"
      (para "It's not quite accurate to say that {CCL}'s compiler
          and runtime follow precise stack and register usage
          conventions at all times; there are a few exceptions:")
      (listing :bullet
	(item "On both PPC and x86-64 platforms, consing isn't
	          fully atomic.It takes at least a few instructions to
	          allocate an object in memory(and slap a header on it if
	          necessary); if a thread is interrupted in the middle of
	          that instruction sequence, the new object may or may
	          not have been created or fully initialized at the point in
	          time that the interrupt occurred.  (There are actually a
	          few different states of partial initialization)")
	(item "On the PPC, the common act of building a lisp
	          control stack frame involves allocating a four-word frame
	          and storing three register values into that frame.  (The
	          fourth word - the back pointer to the previous frame - is
	          automatically set when the frame is allocated.)  The
	          previous contents of those three words are unknown (there
	          might have been a foreign stack frame at the same address a
	          few instructions earlier),so interrupting a thread that's
	          in the process of initializing a PPC control stack frame
	          isn't GC-safe.")
	(item "There are similar problems with the initialization
	          of temp stackframes on the PPC.  (Allocation and
	          initialization doesn't happen atomically, and the newly
	          allocated stack memory may have undefined contents.)")
	(item "{section The ephemeral GC}'s write barrier
	          has to be implemented atomically (i.e.,both an
	          intergenerational store and the update of a
	          corresponding reference bit has to happen without
	          interruption, or neither of these events can
	          happen.)")
	(item "There are a few more similar cases."))
      "Fortunately, the number of these non-atomic instruction
          sequences is small, and fortunately it's fairly easy for the
          interrupting thread to recognize when the interrupted thread
          is in the middle of such a sequence.  When this is detected,
          the interrupting thread modifies the state of the interrupted
          thread (modifying its PC and other registers) so that it is no
          longer in the middle of such a sequence (it's either backed
          out of it or the remaining instructions are emulated.)

          This works because (a) many of the troublesome
          instruction sequences are PPC-specific and it's relatively
          easy to partially disassemble the instructions surrounding the
          interrupted thread's PC on the PPC and (b) those instruction
          sequences are heavily stylized and intended to be easily
          recognized."))
  (defsection "Register usage and tagging"
    (defsection "Overview"
      #:|Regardless of other details of its implementation, a
	      garbage collector's job is to partition the set of all
	      heap-allocated lisp objects (CONSes, STRINGs, INSTANCEs, etc.)
	      into two subsets.  The first subset contains all objects that
	      are transitively referenced from a small set of "root" objects
	      (the contents of the stacks and registers of all active
	      threads at the time the GC occurs and the values of some
	      global variables.)  The second subset contains everything
	      else: those lisp objects that are not transitively reachable
	      from the roots are garbage, and the memory occupied by garbage
	      objects can be reclaimed (since the GC has just proven that
 	      it's impossible to reference them.)

          The set of live, reachable lisp objects basically form
          the nodes of a (usually large) graph, with edges from each
          node A to any other objects (nodes) that object A
          references.

          Some nodes in this graph can never have outgoing edges:
          an array with a specialized numeric or character type usually
          represents its elements in some (possibly more compact)
          specialized way.  Some nodes may refer to lisp objects that
          are never allocated in memory (FIXNUMs, CHARACTERs,
          SINGLE-FLOATs on 64-bit platforms ..)  This latter class of
          objects are sometimes called "immediates", but that's a little
          confusing because the term "immediate" is sometimes used to
          refer to things that can never be part of the big connectivity
          graph (e.g., the "raw" bits that make up a floating-point
          value, foreign address, or numeric value that needs to be used
          - at least fleetingly - in compiled code.)

          For the GC to be able to build the connectivity graph
          reliably, it's necessary for it to be able to reliably tell
          (a) whether or not a "potential root" - the contents of a
          machine register or stack location - is in fact a node and (b)
          for any node, whether it may have components that refer to
          other nodes.

          There's no reliable way to answer the first question on
          stock hardware.  (If everything was a node, as might be the
          case on specially microcoded "lisp machine" hardware, it
          wouldn't even need to be asked.)  Since there's no way to just
          look at a machine word (the contents of a machine register or
          stack location) and tell whether or not it's a node or just
          some random non-node value, we have to either adopt and
          enforce strict conventions on register and stack usage or
          tolerate ambiguity.

          "Tolerating ambiguity" is an approach taken by some
          ("conservative") GC schemes; by contrast, {CCL}'s GC is
          "precise", which in this case means that it believes that the
          contents of certain machine registers and stack locations are
          always nodes and that other registers and stack locations are
          never nodes and that these conventions are never violated by
          the compiler or runtime system.  The fact that threads are
          preemptively scheduled means that a GC could occur (because of
          activity in some other thread) on any instruction boundary,
          which in turn means that the compiler and runtime system must
          follow precise {section Register and stack usage conventions} at all
          times.

          Once we've decided that a given machine word is a node,
          a {section Tagging scheme} describes how the node's
          value and type are encoded in that machine word.

          Most of this discussion-so far-has treated
          things from the GC's very low-level perspective. From a much
          higher point of view, lisp functions accept nodes as
          arguments, return nodes as values, and (usually) perform
          some operations on those arguments in order to produce those
          results.  (In many cases, the operations in question involve
          raw non-node values.)  Higher-level parts of the lisp type
          system (functions like TYPE-OF and CLASS-OF, etc.) depend on
          the {section Tagging scheme}.|)
    (defsection "pc-locatives on the PPC"
      #:|On the PPC, there's a third case (besides "node" and
          "immediate" values).  As discussed below, a node that denotes
          a memory-allocated lisp object is a biased (tagged) pointer
          -to- that object; it's not generally possible to point -into-
          some composite (multi-element) object (such a pointer would
          not be a node, and the GC would have no way to update the
          pointer if it were to move the underlying object.)

          Such a pointer ("into" the interior of a heap-allocated
          object) is often called a {emphasis locative}; the
          cases where locatives are allowed in {CCL} mostly involve
          the behavior of function call and return instructions.  (To be
          technically accurate, the other case also arises on x86-64, but
          that case isn't as user-visible.)

          On the PowerPC (both PPC32 and PPC64), all machine
          instructions are 32 bits wide and all instruction words are
          allocated on 32-bit boundaries.  In PPC {CCL}, a CODE-VECTOR
          is a specialized type of vector-like object; its elements
          are 32-bit PPC machine instructions.  A CODE-VECTOR is an
          attribute of a FUNCTION object; a function call involves
          accessing the function's code-vector and jumping to the
          address of its first instruction.

          As each instruction in the code vector sequentially
          executes, the hardware program counter (PC) register advances
          to the address of the next instruction (a locative into the
          code vector); since PPC instructions are always 32 bits wide
          and aligned on 32-bit boundaries, the low two bits of the PC
          are always 0.  If the function executes a call (simple call
          instructions have the mnemonic "bl" on the PPC, which stands
          for "branch and link"), the address of the next instruction
          (also a word-aligned locative into a code-vector) is copied
          into the special- purpose PPC "link register" (lr); a function
          returns to its caller via a "branch to link register" (blr)
          instruction.  Some cases of function call and return might
          also use the PPC's "count register" (ctr), and if either the
          lr or ctr needs to be stored in memory it needs to first be
          copied to a general-purpose register.

          {CCL}'s GC understands that certain registers contain
          these special "pc-locatives" (locatives that point into
          CODE-VECTOR objects); it contains special support for
          finding the containing CODE-VECTOR object and for adjusting
          all of these "pc-locatives" if the containing object is
          moved in memory.  The first part of that
          operation-finding the containing object-is
          possible and practical on the PPC because of architectural
          artifacts (fixed-width instructions and arcana of
          instruction encoding.)  It's not possible on x86-64, but
          fortunately not necessary either (though the second part -
          adjusting the PC/RIP when the containing object moves) is
          both necessary and simple.|)
    (defsection "Register and stack usage conventions"
      (defsection "Stack conventions"
	"On both PPC and X86 platforms, each lisp thread uses 3
            stacks; the ways in which these stacks are used differs
            between the PPC and X86.

Each thread has:"
	(listing :bullet
	  (item #:|A "control stack".  On both platforms, this is
	            "the stack" used by foreign code.  On the PPC, it
	            consists of a linked list of frames where the first word
	            in each frame points to the first word in the previous
	            frame (and the outermost frame points to 0.)  Some
	            frames on a PPC control stack are lisp frames; lisp
	            frames are always 4 words in size and contain (in
	            addition to the back pointer to the previous frame) the
	            calling function (a node), the return address (a
	            "locative" into the calling function's code-vector), and
	            the value to which the value-stack pointer (see below)
	            should be restored on function exit.  On the PPC, the GC
	            has to look at control-stack frames, identify which of
	            those frames are lisp frames, and treat the contents of
	            the saved function slot as a node (and handle the return
	            address locative specially.)  On x86-64, the control
	            stack is used for dynamic-extent allocation of immediate
	            objects.  Since the control stack never contains nodes
	            on x86-64, the GC ignores it on that platform.
	            Alignment of the control stack follows the ABI
	            conventions of the platform (at least at any point in
	            time where foreign code could run.)  On PPC, the r1
	            register always points to the top of the current
	            thread's control stack; on x86-64, the RSP register
	            points to the top of the current thread's control stack
	            when the thread is running foreign code and the address
	            of the top of the control stack is kept in the thread's
	            TCR (see {section The Thread Context Record}
	            when not running foreign code.  The control stack "grows
	            down."|)
	  (item #:|A "value stack".  On both platforms, all values on
	            the value stack are nodes (including "tagged return
	            addresses" on x86-64.)  The value stack is always
	            aligned to the native word size; objects are always
	            pushed on the value stack using atomic instructions
	            ("stwu"/"stdu" on PPC, "push" on x86-64), so the
	            contents of the value stack between its bottom and top
	            are always unambiguously nodes; the compiler usually
	            tries to pop or discard nodes from the value stack as
	            soon as possible after their last use (as soon as they
	            may have become garbage.)  On x86-64, the RSP register
	            addresses the top of the value stack when running lisp
	            code; that address is saved in the TCR when running
	            foreign code.  On the PPC, a dedicated register (VSP,
	            currently r15) is used to address the top of the value
	            stack when running lisp code, and the VSP value is saved
	            in the TCR when running foreign code.  The value stack
	            grows down.|)
	  (item #:|A "temp stack".  The temp stack consists of a
	            linked list of frames, each of which points to the
	            previous temp stack frame.  The number of native
	            machine words in each temp stack frame is always even,
	            so the temp stack is aligned on a two-word (64- or
	            128-bit) boundary.  The temp stack is used for
	            dynamic-extent objects on both platforms; on the PPC,
	            it's used for essentially all such objects (regardless
	            of whether or not the objects contain nodes); on the
	            x86-64, immediate dynamic-extent objects (strings,
	            foreign pointers, etc.)  are allocated on the control
	            stack and only node-containing dynamic-extent objects
	            are allocated on the temp stack.  Data structures used
	            to implement CATCH and UNWIND-PROTECT are stored on
	            the temp stack on both ppc and x86-64.  Temp stack
	            frames are always doublenode aligned and objects
	            within a temp stack frame are aligned on doublenode
	            boundaries.  The first word in each frame contains a
	            back pointer to the previous frame; on the PPC, the
	            second word is used to indicate to the GC whether the
	            remaining objects are nodes (if the second word is 0)
	            or immediate (otherwise.)  On x86-64, where temp stack
	            frames always contain nodes, the second word is always
	            0.  The temp stack grows down.  It usually takes
	            several instructions to allocate and safely initialize
	            a temp stack frame that's intended to contain nodes,
	            and the GC has to recognize the case where a thread is
	            in the process of allocating and initializing a temp
	            stack frame and take care not to interpret any
	            uninitialized words in the frame as nodes. The PPC
	            keeps the current top of the temp stack in a dedicated
	            register (TSP, currently r12) when running lisp code
	            and saves this register's value in the TCR when
	            running foreign code.  The x86-64 keeps the address of
	            the top of each thread's temp stack in the thread's
	            TCR.|)))
      (defsection "Register conventions"
	#:|If there are a "reasonable" (for some value of
            "reasonable") number of general-purpose registers and the
            instruction set is "reasonably" orthogonal (most
            instructions that operate on GPRs can operate on any GPR),
            then it's possible to statically partition the GPRs into at
            least two sets: "immediate registers" never contain nodes,
            and "node registers" always contain nodes.  (On the PPC, a
            few registers are members of a third set of "PC locatives",
            and on both platforms some registers may have dedicated
            roles as stack or heap pointers; the latter class is treated
            as immediates by the GC proper but may be used to help
            determine the bounds of stack and heap memory areas.)

            The ultimate definition of register partitioning is
            hardwired into the GC in functions like "mark_xp()" and
            "forward_xp()", which process the values of some of the
            registers in an exception frame as nodes and may give some
            sort of special treatment to other register values they
            encounter there.)

On x86-64, the static register partitioning scheme involves:|
	(listing :bullet
	  (item #:|(only) three "immediate" registers.

	            The RAX, RCX, and RDX registers are used as the
	            implicit operands and results of some extended-precision
	            multiply and divide instructions which generally involve
	            non-node values; since their use in these instructions
	            means that they can't be guaranteed to contain node
	            values at all times, it's natural to put these registers
	            in the "immediate" set. RAX is generally given the
	            symbolic name "imm0", RDX is given the symbolic name
	            "imm1" and RCX is given the symbolic name "imm2"; you
	            may see these names in disassembled code, usually in
	            operations involving type checking, array indexing, and
	            foreign memory and function access.|)
	  (item #:|(only) two "dedicated" registers.

	            RSP and RBP have
	            dedicated functionality dictated by the hardware and
	            calling conventions.|)
	  (item #:|11 "node" registers.

	            All other registers (RBX, RSI, RDI, and R8-R15)
	            are asserted to contain node values at (almost) all
	            times; legacy "string" operations that implicitly use RSI
	            and/or RDI are not used.|))
	(para "
		On 32-bit x86, the default register partitioning scheme
		involves:
	      ")
	(listing :bullet
	  (item #:|
		  A single "immediate" register.



		    The EAX register is given the symbolic name
		    "imm0".
		  |)
	  (item #:|
		    There are two "dedicated" registers.



		    ESP and EBP have dedicated functionality dictated by the
		    hardware and calling conventions.
		  |)
	  (item #:|
		    5 "node" registers.



		    The remaining registers, (EBX, ECX, EDX, ESI, EDI) normally
		    contain node values.  As on x86-64, string instructions
		    that implicitly use ESI and EDI are not used.
		  |))
	"
		There are times when this default partitioning scheme is
		inadequate.  As mentioned in the x86-64 section, there are
		instructions like the extended-precision MUL and DIV which
		require the use of EAX and EDX.  We therefore need a way to
		change this partitioning at run-time.



		Two schemes are employed.  The first uses a mask in the TCR
		that contains a bit for each register.  If the bit is set,
		the register is interpreted by the GC as a node register; if it's
		clear, the register is treated as an immediate register.  The
		second scheme uses the direction flag in the EFLAGS register.
		If DF is set, EDX is treated as an immediate register.
		(We don't use the string instructions, so DF isn't otherwise
		used.)


            On the PPC, the static register partitioning scheme
            involves:"
	(listing :bullet
	  (item #:|6 "immediate" registers.

	            Registers r3-r8 are given
	            the symbolic names imm0-imm5.  As a RISC architecture
	            with simpler addressing modes, the PPC probably
	            uses immediate registers a bit more often than the CISC
	            x86-64 does, but they're generally used for the same sort
	            of things (type checking, array indexing, FFI,
	            etc.)|)
	  (item "9 dedicated registers
	            "
	    (listing :bullet
	      (item "r0 (symbolic name rzero) always contains the
		              value 0 when running lisp code.  Its value is
		              sometimes read as 0 when it's used as the base
		              register in a memory address; keeping the value 0
		              there is sometimes convenient and avoids
		              asymmetry.")
	      (item "r1 (symbolic name sp) is the control stack
		              pointer, by PPC convention.")
	      (item "r2 is used to hold the current thread's TCR on
		              ppc64 systems; it's not used on ppc32.")
	      (item "r9 and r10 (symbolic names allocptr and
		              allocbase) are used to do per-thread memory
		              allocation")
	      (item "r11 (symbolic name nargs) contains the number
		              of function arguments on entry and the number of
		              return values in multiple-value returning
		              constructs.  It's not used more generally as either
		              a node or immediate register because of the way that
		              certain trap instruction encodings are
		              interpreted.")
	      (item "r12 (symbolic name tsp) holds the top of the
		              current thread's temp stack.")
	      (item "r13 is used to hold the TCR on PPC32 systems;
		              it's not used on PPC64.")
	      (item #:|r14 (symbolic name loc-pc) is used to copy
		              "pc-locative" values between main memory and
		              special-purpose PPC registers (LR and CTR) used in
		              function-call and return instructions.|)
	      (item "r15 (symbolic name vsp) addresses the top of
		              the current thread's value stack.")
	      (item #:|lr and ctr are PPC branch-unit registers used
		              in function call and return instructions; they're
		              always treated as "pc-locatives", which precludes
		              the use of the ctr in some PPC looping
		              constructs.|)))
	  (item #:|17 "node" registers

	            r15-r31 are always treated as node
	            registers|))))
    (defsection "Tagging scheme"
      #:|{CCL} always allocates lisp objects on double-node
          (64-bit for 32-bit platforms, 128-bit for 64-bit platforms)
          boundaries; this mean that the low 3 bits (32-bit lisp) or 4
          bits (64-bit lisp) are always 0 and are therefore redundant
          (we only really need to know the upper 29 or 60 bits in order
          to identify the aligned object address.)  The extra bits in a
          lisp node can be used to encode at least some information
          about the node's type, and the other 29/60 bits represent
          either an immediate value or a doublenode-aligned memory
          address.  The low 3 or 4 bits of a node are called the node's
          "tag bits", and the conventions used to encode type
          information in those tag bits are called a "tagging
          scheme."

          It might be possible to use the same tagging scheme on
          all platforms (at least on all platforms with the same word
          size and/or the same number of available tag bits), but there
          are often some strong reasons for not doing so.  These
          arguments tend to be very machine-specific: sometimes, there
          are fairly obvious machine-dependent tricks that can be
          exploited to make common operations on some types of tagged
          objects faster; other times, there are architectural
          restrictions that make it impractical to use certain tags for
          certain types.  (On PPC64, the "ld" (load doubleword) and
          "std" (store doubleword) instructions - which load and store a
          GPR operand at the effective address formed by adding the
          value of another GPR operand and a 16-bit constant operand -
          require that the low two bits of that constant operand be 0.
          Since such instructions would typically be used to access the
          fields of things like CONS cells and structures, it's
          desirable that that the tags chosen for CONS cells and
          structures allow the use of these instructions as opposed to
          more expensive alternatives.)

          One architecture-dependent tagging trick that works well
          on all architectures is to use a tag of 0 for FIXNUMs: a
          fixnum basically encodes its value shifted left a few bits
          and keeps those low bits clear. FIXNUM addition,
          subtraction, and binary logical operations can operate
          directly on the node operands, addition and subtraction can
          exploit hardware-based overflow detection, and (in the
          absence of overflow) the hardware result of those operations
          is a node (fixnum).  Some other slightly-less-common
          operations may require a few extra instructions, but
          arithmetic operations on FIXNUMs should be as cheap as
          possible and using a tag of zero for FIXNUMs helps to ensure
          that it will be.

	      If we have N available tag bits (N = 3 for 32-bit {CCL}
	      and N = 4 for 64-bit {CCL}), this way of representing
	      fixnums with the low M bits forced to 0 works as long as M
	      <= N.  The smaller we make M, the larger the values of
	      MOST-POSITIVE-FIXNUM and MOST-NEGATIVE become; the larger we
	      make N, the more distinct non-FIXNUM tags become available.
	      A reasonable compromise is to choose M = N-1; this basically
	      yields two distinct FIXNUM tags (one for even fixnums, one
	      for odd fixnums), gives 30-bit fixnums on 32-bit platforms
	      and 61-bit fixnums on 64-bit platforms, and leaves us with 6
	      or 14 tags to encoded other types.

          Once we get past the assignment of FIXNUM tags, things
          quickly devolve into machine-dependencies.  We can fairly
          easily see that we can't directly tag all other primitive
          lisp object types with only 6 or 14 available tag values;
          the details of how types are encoded vary between the ppc32,
          ppc64, and x86-64 implementations, but there are some
          general common principles:|
      (listing :bullet
	(item #:|CONS cells always contain exactly 2 elements and are
	          usually fairly common.It therefore makes sense to give
	          CONS cells their own tag.  Unlike the fixnum case -
	          where a tag value of 0 had positive implications - there
	          doesn't seem to be any advantage to using any particular
	          value.  (A longtime ago - in the case of 68K MCL - the
	          CONS tag and the order of CAR and CDR in memory were
	          chosen to allow smaller, cheaper addressing modes to be
	          used to "cdr down a list."  That's not a factor on ppc
	          or x86-64, but all versions of {CCL} still store the CDR
	          of a CONS cell first in memory.  It doesn't matter, but
	          doing it the way that the host system did made
	          bootstrapping to a new target system a little easier.)
	        |)
	(item #:|Any way you look at it, NIL is a bit
	          ... unusual. NIL is both a SYMBOL and a LIST (as well as
	          being a canonical truth value and probably a few other
	          things.)  Its role as a LIST is probably much more
	          important to most programs than its role as a SYMBOL is:
	          LISTP has to be true of NIL and primitives like CAR and
	          CDR do LISTP implicitly when safe and want that
	          operation to be fast. There are several possible
	          approaches to this problem; {CCL} uses two of them. On
	          PPC32 and X86-64, NIL is basically a weird CONS cell
	          that straddles two doublenodes; the tag of NIL is unique
	          and congruent modulo 4 (modulo 8 on 64-bit) with the tag
	          used for CONS cells.  LISTP is therefore true of any
	          node whose low 2 (or 3) bits contain the appropriate tag
	          value (it's not otherwise necessary to special-case
	          NIL.)  SYMBOL accessors (SYMBOL-NAME, SYMBOL-VALUE,
	          SYMBOL-PLIST ..) -do- have to special-case NIL (and
	          access the components of an internal proxy symbol.) On
	          PPC64 (where architectural restrictions dictate the set
	          of tags that can be used to access fixed components of
	          an object), that approach wasn't practical.  NIL is just
	          a distinguished SYMBOL,and it just happens to be the
	          case that its pname slot and values slot are at the same
	          offsets from a tagged pointer as a CONS cell's CDR and
	          CAR would be.  NIL's pname is set to NIL (SYMBOL-NAME
	          checks for this and returns the string "NIL"), and LISTP
	          (and therefore safe CAR and CDR) has to check for (OR
	          NULL CONSP). At least in the case of CAR and CDR, the
	          fact that the PPC has multiple condition-code fields
	          keeps that extra test from being prohibitively
	          expensive.  On IA-32, we can't afford to dedicate a tag to
		  NIL. NIL is therefore just a distinguished CONS
		  cell, and we have to explicitly check for a NIL argument
		  in CONSP/RPLACA/RPLACD.
		|)
	(item "Some objects are immediate (but not FIXNUMs). This
	          is true of CHARACTERs and, on 64-bit platforms,
	          SINGLE-FLOATs. It's also true of some nodes used in the
	          runtime system (special values used to indicate unbound
	          variables and slots, for instance.) On 64-bit platforms,
	          SINGLE-FLOATs have their own unique tag (making them a
	          little easier to recognize; on all platforms, CHARACTERs
	          share a tag with other immediate objects (unbound
	          markers) but are easy to recognize (by looking at
	          several of their low bits.)  The GC treats any node with
	          an immediate tag (and any node with a fixnum tag) as a
	          leaf.")
	(item #:|There are some advantages to treating everything
	          else-memory-allocated objects that aren't CONS
	          cells-uniformly.There are some disadvantages to
	          that uniform treatment as well, and the treatment of
	          "memory-allocated non-CONS objects" isn't entirely
	          uniform across all {CCL} implementations.  Let's first
	          pretend that the treatment is uniform, then discuss the
	          ways in which it isn't.The "uniform approach" is to
	          treat all memory-allocated non-CONS objects as if they
	          were vectors; this use of the term is a little looser
	          than what's implied by the CL VECTOR type.  {CCL}
	          actually uses the term "uvector" to mean "a
	          memory-allocated lisp object other than a CONS cell,
	          whose first word is a header that describes the object's
	          type and the number of elements that it contains."  In
	          this view, a SYMBOL is a UVECTOR, as is a STRING, a
	          STANDARD-INSTANCE, a CL array or vector, a FUNCTION, and
	          even a DOUBLE-FLOAT. In the PPC implementations (where
	          things are a little more ... uniform), a single tag
	          value is used to denote any uvector; in order to
	          determine something more specific about the type of the
	          object in question, it's necessary to fetch the low byte
	          of the header word from memory.  On the x86-64 platform,
	          certain types of uvectors - SYMBOLs and FUNCTIONs -are
	          given their own unique tags.  The good news about the
	          x86-64 approach is that SYMBOLs and FUNCTIONs can be
	          recognized without referencing memory; the slightly bad
	          news is that primitive operations that work on
	          UVECTOR-tagged objects-like the function
	          CCL:UVREF-don't work on SYMBOLs or FUNCTIONs on
	          x86-64 (but -do- work on those types of objects in the
	          PPC ports.) The header word that precedes a UVECTOR's
	          data in memory contains 8 bits of type information in
	          the low byte and either 24 or 56 bits of "element-count"
	          information in the rest of the word.  (This is where the
	          sometimes-limiting value of 2^24 for
	          ARRAY-TOTAL-SIZE-LIMIT on 32-bit platforms comes from.)
	          The low byte of the header-sometimes called the
	          uvector's subtag-is itself tagged (which means
	          that the header is tagged.)  The (3 or 4) tag bits in
	          the subtag are used to determine whether the uvector's
	          elements are nodes or immediates. (A UVECTOR whose
	          elements are nodes is called a GVECTOR; a UVECTOR whose
	          elements are immediates is called an IVECTOR.  This
	          terminology came from Spice Lisp, which was a
	          predecessor of CMUCL.)  Even though a uvector header is
	          tagged, a header is not a node.  There's no (supported)
	          way to get your hands on one in lisp and doing so could
	          be dangerous.  (If the value of a header wound up in a
	          lisp node register and that register wound up getting
	          pushed on a thread's value stack, the GC might
	          misinterpret that situation to mean that there was a
	          stack-allocated UVECTOR on the value stack.)|))))
  (defsection "Heap Allocation"
    #:|When the {CCL} kernel first
        starts up, a large contiguous chunk of the process's address
        space is mapped as "anonymous, no access" memory. ("Large"
        means different things in different contexts; on LinuxPPC32,
        it means "about 1 gigabyte", on DarwinPPC32, it means "about 2
        gigabytes", and on current 64-bit platforms it ranges from 128
        to 512 gigabytes, depending on OS. These values are both
        defaults and upper limits;
        the {code --heap-reserve} argument can be used to
        try to reserve less than the default.)

        Reserving address space that can't (yet) be read or
        written to doesn't cost much; in particular, it doesn't require
        that corresponding swap space or physical memory be available.
        Marking the address range as being "mapped" helps to ensure that
        other things (results from random calls to malloc(), dynamically
        loaded shared libraries) won't be allocated in this region that
        lisp has reserved for its own heap growth.

        A small portion (around 1/32 on 32-bit platforms and 1/64
        on 64-bit platforms) of that large chunk of address space is
        reserved for GC data structures.  Memory pages reserved for
        these data structures are mapped read-write as pages are made
        writable in the main portion of the heap.

        The initial heap image is mapped into this reserved
        address space and an additional (LISP-HEAP-GC-THRESHOLD) bytes
        are mapped read-write.  GC data structures grow to match the
        amount of GC-able memory in the initial image plus the gc
        threshold, and control is transferred to lisp code.
        Inevitably, that code spoils everything and starts consing;
        there are basically three layers of memory allocation that can
        go on.|
    (defsection "Per-thread object allocation"
      #:|Each lisp thread has a private "reserved memory
          segment"; when a thread starts up, its reserved memory segment
          is empty.  PPC ports maintain the highest unallocated address
          and the lowest allocatable address in the current segment in
          registers when running lisp code; on x86-664, these values are
          maintained in the current threads's TCR.  (An "empty" heap
          segment is one whose high pointer and low pointer are equal.)
          When a thread is not in the middle of allocating something, the
          low 3 or 4 bits of the high and low pointers are clear (the
          pointers are doublenode-aligned.)

          A thread tries to allocate an object whose physical size
          in bytes is X and whose tag is Y by:|
      (listing :number
	(item #:|decrementing the "high" pointer by (- X Y)|)
	(item "trapping if the high pointer is less than the low
	          pointer")
	(item "using the (tagged) high pointer to initialize the
	          object, if necessary")
	(item "clearing the low bits of the high pointer"))
      (para "On PPC32, where the size of a CONS cell is 8 bytes and
          the tag of a CONS cell is 1, machine code which sets the arg_z
          register to the result of doing (CONS arg_y arg_z) looks
          like:")
      (code-block "
  (SUBI ALLOCPTR ALLOCPTR 7)    ; decrement the high pointer by (- 8 1)
  (TWLLT ALLOCPTR ALLOCBASE)    ; trap if the high pointer is below the base
  (STW ARG_Z -1 ALLOCPTR)       ; set the CDR of the tagged high pointer
  (STW ARG_Y 3 ALLOCPTR)        ; set the CAR
  (MR ARG_Z ALLOCPTR)           ; arg_z is the new CONS cell
  (RLWINM ALLOCPTR ALLOCPTR 0 0 28)     ; clear tag bits
	    ")
      (para "On x86-64, the idea's similar but the implementation is
          different.  The high and low pointers to the current thread's
          reserved segment are kept in the TCR, which is addressed by
          the gs segment register. An x86-64 CONS cell is 16 bytes wide
          and has a tag of 3; we canonically use the temp0 register to
          initialize the object")
      (code-block "
  (subq ($ 13) ((% gs) 216))    ; decrement allocptr
  (movq ((% gs) 216) (% temp0)) ; load allocptr into temp0
  (cmpq ((% gs) 224) (% temp0)) ; compare to allocabase
  (jg L1)                       ; skip trap
  (uuo-alloc)                   ; uh, don't skip trap
L1
  (andb ($ 240) ((% gs) 216))   ; untag allocptr in the tcr
  (movq (% arg_y) (5 (% temp0))) ; set the car
  (movq (% arg_z) (-3 (% temp0))); set the cdr
  (movq (% temp0) (% arg_z))    ; return the cons
	    ")
      (para "If we don't take the trap (if allocating 8-16 bytes
          doesn't exhaust the thread's reserved memory segment), that's
          a fairly short and simple instruction sequence.  If we do take
          the trap, we'll have to do some additional work in order to
          get a new segment for the current thread."))
    (defsection "Allocation of reserved heap segments"
      #:|After the lisp image is first mapped into memory - and after
          each full GC - the lisp kernel ensures that
          (LISP-HEAP-GC-THRESHOLD) additional bytes beyond the current
          end of the heap are mapped read-write.

          If a thread traps while trying to allocate memory, the
          thread goes through the usual exception-handling protocol (to
          ensure that any other thread that GCs "sees" the state of the
          trapping thread and to serialize exception handling.)  When
          the exception handler runs, it determines the nature and size
          of the failed allocation and tries to complete the allocation
          on the thread's behalf (and leave it with a reasonably large
          thread-specific memory segment so that the next small
          allocation is unlikely to trap.

          Depending on the size of the requested segment
          allocation, the number of segment allocations that have
          occurred since the last GC, and the EGC and GC thresholds, the
          segment allocation trap handler may invoke a full or ephemeral
          GC before returning a new segment.  It's worth noting that the
          [E]GC is triggered based on the number of and size of these
          segments that have been allocated since the last GC; it doesn't
          have much to do with how "full" each of those per-thread
          segments are.  It's possible for a large number of threads to
          do fairly incidental memory allocation and trigger the GC as a
          result; avoiding this involves tuning the per-thread
          allocation quantum and the GC/EGC thresholds
          appropriately.|)
    (defsection "Heap growth"
      #:|All OSes on which {CCL} currently runs use an
          "overcommit" memory allocation strategy by default (though
          some of them provide ways of overriding that default.)  What
          this means in general is that the OS doesn't necessarily
          ensure that backing store is available when asked to map pages
          as read-write; it'll often return a success indicator from the
          mapping attempt (mapping the pages as "zero-fill,
          copy-on-write"), and only try to allocate the backing store
          (swap space and/or physical memory) when non-zero contents are
          written to the pages.

          It -sounds- like it'd be better to have the mmap() call
          fail immediately, but it's actually a complicated issue.
          (It's possible that other applications will stop using some
          backing store before lisp code actually touches the pages that
          need it, for instance.)  It's also not guaranteed that lisp
          code would be able to "cleanly" signal an out-of-memory
          condition if lisp is ... out of memory

	      I don't know that I've ever seen an abrupt out-of-memory
	      failure that wasn't preceded by several minutes of excessive
	      paging activity.  The most expedient course in cases like this
	      is to either (a) use less memory or (b) get more memory; it's
	      generally hard to use memory that you don't have.|))
  (defsection "GC details"
    "The GC uses a Mark/Compact algorithm; its
        execution time is essentially a factor of the amount of live
        data in the heap. (The somewhat better-known Mark/Sweep
        algorithms don't compact the live data but instead traverse the
        garbage to rebuild free-lists; their execution time is therefore
        a factor of the total heap size.)

        As mentioned in {section Heap Allocation}, two
        auxiliary data structures (proportional to the size of the lisp
        heap) are maintained. These are"
    (listing :number
      (item "the markbits bitvector, which contains a bit for
	        every doublenode in the dynamic heap (plus a few extra words
	        for alignment and so that sub-bitvectors can start on word
	        boundaries.)")
      (item "the relocation table, which contains a native word for
	        every 32 or 64 doublenodes in the dynamic heap, plus an
	        extra word used to keep track of the end of the heap."))
    "The total GC space overhead is therefore on the order of
        3% (2/64 or 1/32).

The general algorithm proceeds as follows:"
    (defsection "Mark phase"
      "Each doublenode in the dynamic heap has a corresponding
          bit in the markbits vector. (For any doublenode in the heap,
          the index of its mark bit is determined by subtracting the
          address of the start of the heap from the address of the
          object and dividing the result by 8 or 16.) The GC knows the
          markbit index of the free pointer, so determining that the
          markbit index of a doubleword address is between the start of
          the heap and the free pointer can be done with a single
          unsigned comparison.

          The markbits of all doublenodes in the dynamic heap are
          zeroed before the mark phase begins. An object is
          {emphasis marked} if the markbits of all of its
          constituent doublewords are set and unmarked otherwise;
          setting an object's markbits involves setting the corresponding
          markbits of all constituent doublenodes in the object.

          The mark phase traverses each root. If the tag of the
          value of the root indicates that it's a non-immediate node
          whose address lies in the lisp heap, then:"
      (listing :number
	(item "If the object is already marked, do nothing.")
	(item "Set the object's markbit(s).")
	(item "If the object is an ivector, do nothing further.")
	(item "If the object is a cons cell, recursively mark its
	          car and cdr.")
	(item "Otherwise, the object is a gvector. Recursively mark
	          its elements."))
      "Marking an object thus involves ensuring that its mark
          bits are set and then recursively marking any pointers
          contained within the object if the object was originally
          unmarked. If this recursive step was implemented in the
          obvious manner, marking an object would take stack space
          proportional to the length of the pointer chain from some root
          to that object. Rather than storing that pointer chain
          implicitly on the stack (in a series of recursive calls to the
          mark subroutine), the {CCL} marker uses mixture of recursion
          and a technique called {emphasis link inversion} to
          store the pointer chain in the objects themselves.  (Recursion
          tends to be simpler and faster; if a recursive step notes that
          stack space is becoming limited, the link-inversion technique
          is used.)

Certain types of objects are treated a little specially:"
      (listing :number
	(item #:|To support a feature called {emphasis GCTWA}
                (an acronym that I believe comes from MACLISP,
		            where it stood for "Garbage Collection of Truly
		            Worthless Atoms"),
                   the vector that contains the internal
	          symbols of the current package is marked on entry to the
	          mark phase, but the symbols themselves are not marked at
	          this time. Near the end of the mark phase, symbols
	          referenced from this vector which are not otherwise
	          marked are marked if and only if they're somehow
	          distinguishable from newly created symbols (by virtue of
	          their having function bindings, value bindings, plists,
	          or other attributes.)|)
	(item "Pools have their first element set to NIL before any
	          other elements are marked.")
	(item "All hash tables have certain fields (used to cache
	          previous results) invalidated.")
	(item "Weak Hash Tables and other weak objects are put on a
	          linkedlist as they're encountered; their contents are only
	          retained if there are other (non-weak) references to
	          them."))
      (para "At the end of the mark phase, the markbits of all
          objects that are transitively reachable from the roots are
          set and all other markbits are clear."))
    (defsection "Relocation phase"
      "The {emphasis forwarding address} of a
	      doublenode in the dynamic heap is (<its current address> -
	      (size_of_doublenode * <the number of unmarked markbits that
	      precede it>)) or alternately (<the base of the heap> +
	      (size_of_doublenode * <the number of marked markbits that
	      precede it >)). Rather than count the number of preceding
	      markbits each time, the relocation table is used to precompute
	      an approximation of the forwarding addresses for all
	      doublewords. Given this approximate address and a pointer into
	      the markbits vector, it's relatively easy to compute the exact
	      forwarding address.

	      The relocation table contains the forwarding addresses
	      of each {emphasis pagelet}, where a pagelet is 256
	      bytes (or 32 doublenodes). The forwarding address of the first
	      pagelet is the base of the heap. The forwarding address of the
	      second pagelet is the sum of the forwarding address of the
	      first and 8 bytes for each mark bit set in the first 32-bit
	      word in the markbits table. The last entry in the relocation
	      table contains the forwarding address that the freepointer
	      would have, e.g., the new value of the freepointer after
	      compaction.

	      In many programs, old objects rarely become garbage and
	      new objects often do. When building the relocation table, the
	      relocation phase notes the address of the first unmarked
	      object in the dynamic heap. Only the area of the heap between
	      the first unmarked object and the freepointer needs to be
	      compacted; only pointers to this area will need to be
	      forwarded (the forwarding address of all other pointers to the
	      dynamic heap is the address of that pointer.)  Often, the
	      first unmarked object is much nearer the free pointer than it
	      is to the base of the heap.")
    (defsection "Forwarding phase"
      #:|The forwarding phase traverses all roots and the "old"
          part of the dynamic heap (the part between the base of the
          heap and the first unmarked object.) All references to objects
          whose address is between the first unmarked object and the
          free pointer are updated to point to the address the object
          will have after compaction by using the relocation table and
          the markbits vector and interpolating.

	      The relocation table entry for the pagelet nearest the
	      object is found. If the pagelet's address is less than the
	      object's address, the number of set markbits that precede
	      the object on the pagelet is used to determine the object's
	      address; otherwise, the number of set markbits that follow
	      the object on the pagelet is used.

          Since forwarding views the heap as a set of doublewords,
          locatives are (mostly) treated like any other pointers. (The
          basic difference is that locatives may appear to be tagged as
          fixnums, in which case they're treated as word-aligned
          pointers into the object.)

          If the forward phase changes the address of any hash
          table key in a hash table that hashes by address (e.g., an EQ
          hash table), it sets a bit in the hash table's header. The
          hash table code will rehash the hash table's contents if it
          tries to do a lookup on a key in such a table.

          Profiling reveals that about half of the total time
          spent in the GC is spent in the subroutine which determines a
          pointer's forwarding address. Exploiting GCC-specific idioms,
          hand-coding the routine, and inlining calls to it could all be
          expected to improve GC performance.|)
    (defsection "Compact phase"
      #:|The compact phase compacts the area between the first
          unmarked object and the freepointer so that it contains only
          marked objects.  While doing so, it forwards any pointers it
          finds in the objects it copies.

          When the compact phase is finished, so is the GC (more
          or less): the free pointer and some other data structures are
          updated and control returns to the exception handler that
          invoked the GC. If sufficient memory has been freed to satisfy
          any allocation request that may have triggered the GC, the
          exception handler returns; otherwise, a "seriously low on
          memory" condition is signaled, possibly after releasing a
          small emergency pool of memory.|))
  (defsection "The ephemeral GC"
    "In the {CCL} memory management scheme, the relative age
        of two objects in the dynamic heap can be determined by their
        addresses: if addresses X and Y are both addresses in the
        dynamic heap, X is younger than Y (X was created more recently
        than Y) if it is nearer to the free pointer (and farther from
        the base of the heap) than Y.

        Ephemeral (or generational) garbage collectors attempt to
        exploit the following assumptions:"
    (listing :bullet
      (item "most newly created objects become garbage soon after
	        they'recreated.")
      (item "most objects that have already survived several GCs
	        are unlikely to ever become garbage.")
      (item "old objects can only point to newer objects as the
	        result of a destructive modification (e.g., via
	        SETF.)"))
    #:|By concentrating its efforts on (frequently and quickly)
        reclaiming newly created garbage, an ephemeral collector hopes
        to postpone the more costly full GC as long as possible. It's
        important to note that most programs create some long-lived
        garbage, so an EGC can't typically eliminate the need for full
        GC.

        An EGC views each object in the heap as belonging to
        exactly one {emphasis generation}; generations are
        sets of objects that are related to each other by age: some
        generation is the youngest, some the oldest, and there's an age
        relationship between any intervening generations. Objects are
        typically assigned to the youngest generation when first
        allocated; any object that has survived some number of GCs in
        its current generation is promoted (or
        {emphasis tenured}) into an older generation.

        When a generation is GCed, the roots consist of the
        stacks, registers, and global variables as always and also of
        any pointers to objects in that generation from other
        generations. To avoid the need to scan those (often large) other
        generations looking for such intergenerational references, the
        runtime system must note all such intergenerational references
        at the point where they're created (via Setf). (This is
            sometimes called "The Write Barrier": all assignments which
            might result in intergenerational references must be noted, as
            if the other generations were write-protected). The
        set of pointers that may contain intergenerational references is
        sometimes called {emphasis the remembered set}.

        In {CCL}'s EGC, the heap is organized exactly the same
        as otherwise; "generations" are merely structures which contain
        pointers to regions of the heap (which is already ordered by
        age.) When a generation needs to be GCed, any younger generation
        is incorporated into it; all objects which survive a GC of a
        given generation are promoted into the next older
        generation. The only intergenerational references that can exist
        are therefore those where an old object is modified to contain a
        pointer to a new object.

        The EGC uses exactly the same code as the full GC. When a
        given GC is "ephemeral",|
    (listing :bullet
      (item #:|the "base of the heap" used to determine an object's
	        markbit address is the base of the generation
	        being collected;|)
      (item "the markbits vector is actually a pointer into the
	        middle of the global markbits table; preceding entries in
	        this table are used to note doubleword addresses in older
	        generations that (may) contain intergenerational
	        references;")
      (item "some steps (notably GCTWA and the handling of weak
	        objects) are not performed;")
      (item #:|the intergenerational references table is used to
	        find additional roots for the mark and forward phases. If a
	        bit is set in the intergenerational references table, that
	        means that the corresponding doubleword (in some "old"
	        generation, in some "earlier" part of the heap) may have had
	        a pointer to an object in a younger generation stored into
	        it.|))
    "With one exception (the implicit setfs that occur on entry
        to and exit from the binding of a special variable), all setfs
        that might introduce an intergenerational reference must be
        memoized.
        Note that the implicit setfs that occur when
        initializing an object - as in the case of a call to cons or
        vector - can't introduce intergenerational references, since
        the newly created object is always younger than the objects
        used to initialize it. It's always safe to
        push any cons cell or gvector locative onto the memo stack;
        it's never safe to push anything else.


        Typically, the intergenerational references bitvector is
        sparse: a relatively small number of old locations are stored
        into, although some of them may have been stored into many
        times. The routine that scans the memoization buffer does a lot
        of work and usually does it fairly often; it uses a simple,
        brute-force method but might run faster if it was smarter about
        recognizing addresses that it'd already seen.


        When the EGC mark and forward phases scan the
        intergenerational reference bits, they can clear any bits that
        denote doublewords that definitely do not contain
        intergenerational references.
      ")
  (defsection "Fasl files"
    #:|Saving and loading of Fasl files is implemented in
        xdump/faslenv.lisp, level-0/nfasload.lisp, and lib/nfcomp.lisp.
        The information here is only an overview, which might help when
        reading the source.

        The {CCL} Fasl format is forked from the old MCL Fasl
        format; there are a few differences, but they are minor.  The
        name "nfasload" comes from the fact that this is the so-called
        "new" Fasl system, which was true in 1986 or so.

        A Fasl file begins with a "file header", which contains
        version information and a count of the following "blocks".
        There's typically only one "block" per Fasl file.  The blocks
        are part of a mechanism for combining multiple logical files
        into a single physical file, in order to simplify the
        distribution of precompiled programs.

        Each block begins with a header for itself, which just
        describes the size of the data that follows.

        The data in each block is treated as a simple stream of
        bytes, which define a bytecode program.  The actual bytecodes,
        "fasl operators", are defined in xdump/faslenv.lisp.  The
        descriptions in the source file are terse, but, according to
        Gary, "probably accurate".

        Some of the operators are used to create a per-block
        "object table", which is a vector used to keep track of
        previously-loaded objects and simplify references to them.  When
        the table is created, an index associated with it is set to
        zero; this is analogous to an array fill-pointer, and allows the
        table to be treated like a stack.

        The low seven bits of each bytecode are used to specify
        the fasl operator; currently, about fifty operators are defined.
        The high byte, when set, indicates that the result of the
        operation should be pushed onto the object table.

        Most bytecodes are followed by operands; the operand data
        is byte-aligned.  How many operands there are, and their type,
        depend on the bytecode.  Operands can be indices into the object
        table, immediate values, or some combination of these.

        An exception is the bytecode #xFF, which has the symbolic
        name ccl::$faslend; it is used to mark the end of the
        block.|)
  (defsection "The Objective-C Bridge"
    (defsection "How {CCL} Recognizes Objective-C Objects"
      #:|In most cases, pointers to instances of Objective-C
          classes are recognized as such; the recognition is (and
          probably always will be) slightly heuristic. Basically, any
          pointer that passes basic sanity checks and whose first word
          is a pointer to a known ObjC class is considered to be an
          instance of that class; the Objective-C runtime system would
          reach the same conclusion.

          It's certainly possible that a random pointer to an
          arbitrary memory address could look enough like an ObjC
          instance to fool the lisp runtime system, and it's possible
          that pointers could have their contents change so that
          something that had either been a true ObjC instance (or had
          looked a lot like one) is changed (possibly by virtue of
          having been deallocated.)

          In the first case, we can improve the heuristics
          substantially: we can make stronger assertions that a
          particular pointer is really "of type :ID" when it's a
          parameter to a function declared to take such a pointer as an
          argument or a similarly declared function result; we can be
          more confident of something we obtained via SLOT-VALUE of a
          slot defined to be of type :ID than if we just dug a pointer
          out of memory somewhere.

          The second case is a little more subtle: ObjC memory
          management is based on a reference-counting scheme, and it's
          possible for an object to ... cease to be an object while lisp
          is still referencing it.  If we don't want to deal with this
          possibility (and we don't), we'll basically have to ensure
          that the object is not deallocated while lisp is still
          thinking of it as a first-class object. There's some support
          for this in the case of objects created with MAKE-INSTANCE,
          but we may need to give similar treatment to foreign objects
          that are introduced to the lisp runtime in other ways (as
          function arguments, return values, SLOT-VALUE results, etc. as
          well as those instances that are created under lisp
          control.)

          This doesn't all work yet (in fact, not much of it works
          yet); in practice, this has not yet been as much of a problem
          as anticipated, but that may be because existing Cocoa code
          deals primarily with relatively long-lived objects such as
          windows, views, menus, etc.|)
    (defsection "Recommended Reading"
      (listing :definition
	(item "{link http://developer.apple.com/documentation/Cocoa/ Cocoa Documentation}" ccldoc::=> "
	            This is the top page for all of Apple's documentation on
	            Cocoa.  If you are unfamiliar with Cocoa, it is a good
	            place to start.
	          ")
	(item
	  "{link http://developer.apple.com/documentation/Cocoa/Reference/Foundation/ObjC_classic/index.html Foundation Reference for Objective-C}"
	  ccldoc::=> "
	            This is one of the two most important Cocoa references; it
	            covers all of the basics, except for GUI programming.  This is
	            a reference, not a tutorial.
	          ")))))
